df <- read_csv("data/healthcare-dataset-stroke-data.csv", col_types = "cfdfffffddcf", na = c("Unknown", "N/A"))
# if you set smoking_status to factor in col_types, na() won't work
df$smoking_status <- as_factor(df$smoking_status)
# married
df$ever_married <- factor(if_else(df$ever_married == "Yes", 1, 0))
# ID
df$id <- NULL
# for models working properly
df$stroke <- factor(ifelse(df$stroke == 1, "yes", "no"), levels = c("no", "yes"))
dfIn smoking_status ‘Unknown’ should be changed to NA.
Also, it can be ordered: never < formerly < smokes
ever_married can be re-coded as 0/1 in accordance with
heart_disease and hypertension
id can be removed
Other columns seem to be OK
skim_desc <- skim(df)
skim_desc %>%
yank("factor")Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1.0 | FALSE | 3 | Fem: 2994, Mal: 2115, Oth: 1 |
| hypertension | 0 | 1.0 | FALSE | 2 | 0: 4612, 1: 498 |
| heart_disease | 0 | 1.0 | FALSE | 2 | 0: 4834, 1: 276 |
| ever_married | 0 | 1.0 | FALSE | 2 | 1: 3353, 0: 1757 |
| work_type | 0 | 1.0 | FALSE | 5 | Pri: 2925, Sel: 819, chi: 687, Gov: 657 |
| Residence_type | 0 | 1.0 | FALSE | 2 | Urb: 2596, Rur: 2514 |
| smoking_status | 1544 | 0.7 | FALSE | 3 | nev: 1892, for: 885, smo: 789 |
| stroke | 0 | 1.0 | FALSE | 2 | no: 4861, yes: 249 |
Target ‘stroke’ is imbalanced!
‘smoking_status’ completeness rate 0.7
One ‘Other’ gender to be removed
df <- df %>% filter(gender != "Other")
skim_desc %>%
yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1.00 | 43.23 | 22.61 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 | ▅▆▇▇▆ |
| avg_glucose_level | 0 | 1.00 | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 | ▇▃▁▁▁ |
| bmi | 201 | 0.96 | 28.89 | 7.85 | 10.30 | 23.50 | 28.10 | 33.10 | 97.60 | ▇▇▁▁▁ |
smoking_status in each target class?df %>% group_by(stroke, smoking_status) %>% summarise(N=n())df %>% filter(is.na(bmi)) %>% group_by(stroke) %>% summarise(n.bmi=n())GGally::ggpairs(df, aes(color = stroke, alpha = 0.2, dotsize = 0.02),
upper = list(continuous = GGally::wrap("cor", size = 2.5)),
diag = list(continuous = "barDiag")) +
scale_color_brewer(palette = "Set1", direction = -1) +
scale_fill_brewer(palette = "Set1", direction = -1)ggplot(df, aes(stroke, age)) +
geom_boxplot(aes(fill = stroke), alpha = 0.5, varwidth = T, notch = T) +
geom_violin(aes(fill = stroke), alpha = 0.5) +
scale_fill_brewer(palette = "Set1", direction = -1) +
xlab("")No surprises here: the older you get the higher the chance of getting stroke.
There are observation with age much below 20 y.o., even close to 0! These are very young kids or babies - should we even include them in the analysis? If yes, the rest will be prediction only for adults.
Stroke in kids probably has very different causes compared to stroke in adults/older folk.
ggplot(df, aes(stroke, age)) +
geom_violin(alpha=0.3) +
geom_jitter(alpha=0.2, size=0.8, width = 0.15, height = 0.1, aes(color = gender)) +
geom_boxplot(alpha = 0.2) +
scale_color_brewer(palette = "Set2", direction = -1)age and
strokeggplot(df, aes(stroke, avg_glucose_level)) +
geom_boxplot(aes(fill = stroke), alpha = 0.5, varwidth = T, notch = T) +
geom_violin(aes(fill = stroke), alpha = 0.5) +
scale_fill_brewer(palette = "Set1", direction = -1) +
xlab("") +
ylab("avg glucose level")This average glucose level is probably the results of fasting blood sugar test
If I correctly understand this CDC information on diabetes, values greater than 126 is evidence of diabetes. But >250? Is it realistic?
ggplot(df, aes(stroke, bmi)) +
geom_boxplot(aes(fill = stroke), alpha = 0.5, varwidth = T, notch = T) +
geom_violin(aes(fill = stroke), alpha = 0.5) +
scale_fill_brewer(palette = "Set1", direction = -1) +
xlab("")Let’s look at this weird points
facet_names <- c("no" = "no stroke", "yes" = "stroke")
ggplot(df, aes(age, bmi)) +
geom_point(color = "steelblue", alpha = 0.8, size = 0.5) +
facet_grid(rows = vars(stroke), labeller = as_labeller(facet_names)) +
guides()Patients with BMI over 75 are also very young. Suspicious.
ggplot(df, aes(age, avg_glucose_level)) +
geom_point(aes(color = smoking_status), alpha = 0.6, size = 1) +
scale_fill_brewer(palette = "Set1", direction = -1) +
facet_grid(rows = vars(stroke), labeller = as_labeller(facet_names)) +
guides()ggplot(df, aes(smoking_status, age)) +
geom_boxplot(aes(fill = stroke), alpha = 0.5, varwidth = T, notch = T) +
scale_fill_brewer(palette = "Set1", direction = -1) +
xlab("")ggplot(df, aes(avg_glucose_level, bmi)) +
geom_point(aes(color = age), alpha = 0.6, size = 1) +
scale_fill_brewer(palette = "Set1", direction = -1) +
facet_grid(rows = vars(stroke), labeller = as_labeller(facet_names)) +
guides() +
xlab("avg glucose level")BMI outliers: super high BMI but super low glucose levels? How’s that possible?
Can it be a bias introduced by testing protocol misuse? Like instead of measuring glucose before eating, in some samples it was done after eating?
Again, all the observations in both target classes form two distinct clusters.
gender <- df %>% group_by(stroke, gender) %>% summarize(N=n())
ggplot(gender, aes(stroke, N)) +
geom_bar(aes(fill=gender), alpha = 0.8, stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2", direction = -1) +
ylab("proportion")hyptens <- df %>% group_by(stroke, hypertension) %>% summarize(N = n())
ggplot(hyptens, aes(stroke, N)) +
geom_bar(aes(fill = hypertension), alpha = 0.8, stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2", direction = -1) +
ylab("proportion")heart <- df %>% group_by(stroke, heart_disease) %>% summarize(N=n())
ggplot(heart, aes(stroke, N)) +
geom_bar(aes(fill = heart_disease), alpha = 0.8, stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2", direction = 1) +
ylab("proportion")married <- df %>% group_by(stroke, ever_married) %>% summarize(N=n())
ggplot(married, aes(stroke, N)) +
geom_bar(aes(fill = ever_married), alpha = 0.8, stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2", direction = -1) +
ylab("proportion")work <- df %>% group_by(stroke, work_type) %>% summarize(N=n())
ggplot(work, aes(stroke, N)) +
geom_bar(aes(fill = work_type), alpha = 0.8, stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2", direction = 1) +
ylab("proportion")It’s good to be a child
It’s bad to be self-employed
residence <- df %>% group_by(stroke, Residence_type) %>% summarize(N=n())
ggplot(residence, aes(stroke, N)) +
geom_bar(aes(fill = Residence_type), alpha = 0.8, stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2", direction = 1) +
ylab("proportion")smoking <- df %>% group_by(stroke, smoking_status) %>% summarize(N=n())
ggplot(smoking, aes(stroke, N)) +
geom_bar(aes(fill = smoking_status), alpha = 0.8, stat = "identity", position = "fill") +
scale_fill_brewer(palette = "Set2", direction = 1) +
ylab("proportion")df %>% filter(work_type == "children") %>%
group_by(smoking_status) %>%
summarise(N = n(),
avg.age = mean(age),
max.age = max(age),
min.age = min(age))A lot of NAs in smoking_status comes from group
‘Children’ (see work_type). I can replace them with ‘never
smoked’ during imputation stage of the analysis.
There are several suspicious outliers (like in BMI and glucose). I can safely remove BMI > 75, maybe even BMI > 60 (Remember that BMI > 40 is the highest class of obesity).
What is less safe - it’s removing non-adults (younger than 20 y.o.).
On one hand they provide valid information (age is very important
predictor), on the other hand their stroke cases are really sus and a
lot of predictors do not have sense (or are obvious NAs) for non-adults
(like smoking, marriage status, employment type, residence type etc.);
model-based imputation of smoking_status in children
doesn’t have sense as well, I should rather replace with “never
smoked”.
Since, modelling using all predictors and observations has given me very moderate results (TPR = 1 comes with very high FPR and very low probability cutoff close to 0), I will try various trimming of the data.
Remove BMI > 60
df_trim <- df %>% filter(bmi <= 60 | is.na(bmi))
skim(df_trim) %>% yank("factor")Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1.0 | FALSE | 2 | Fem: 2985, Mal: 2111, Oth: 0 |
| hypertension | 0 | 1.0 | FALSE | 2 | 0: 4603, 1: 493 |
| heart_disease | 0 | 1.0 | FALSE | 2 | 0: 4820, 1: 276 |
| ever_married | 0 | 1.0 | FALSE | 2 | 1: 3342, 0: 1754 |
| work_type | 0 | 1.0 | FALSE | 5 | Pri: 2912, Sel: 818, chi: 687, Gov: 657 |
| Residence_type | 0 | 1.0 | FALSE | 2 | Urb: 2591, Rur: 2505 |
| smoking_status | 1540 | 0.7 | FALSE | 3 | nev: 1887, for: 884, smo: 785 |
| stroke | 0 | 1.0 | FALSE | 2 | no: 4847, yes: 249 |
4 cases with NA in smoking_status are gone.
skim(df_trim) %>% yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1.00 | 43.23 | 22.63 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 | ▅▆▇▇▆ |
| avg_glucose_level | 0 | 1.00 | 106.15 | 45.29 | 55.12 | 77.28 | 91.88 | 114.06 | 271.74 | ▇▃▁▁▁ |
| bmi | 201 | 0.96 | 28.79 | 7.56 | 10.30 | 23.50 | 28.00 | 33.00 | 59.70 | ▂▇▅▁▁ |
I will replace all NAs in Children (work_type) by ‘never
smoked’.
df_trim$smoking_status <- ifelse((df_trim$work_type == "children" & is.na(df_trim$smoking_status)),
"never smoked", as.character(df_trim$smoking_status))
df_trim$smoking_status <- as.factor(df_trim$smoking_status)
df_trim$smoking_status <- fct_relevel(df_trim$smoking_status, c("never smoked", "formerly smoked", "smokes"))
df_trim %>% filter(work_type == "children") %>%
group_by(smoking_status) %>%
summarise(N = n(),
avg.age = mean(age),
max.age = max(age),
min.age = min(age))No NAs in children’s smoking status anymore.
The rest will be imputed using package mice and two models:
Proportional odds model (polr) for
smoking_status
Predictive mean matching (pmm) for
bmi
library(mice)
imp_mice <- mice(df_trim)##
## iter imp variable
## 1 1 bmi smoking_status
## 1 2 bmi smoking_status
## 1 3 bmi smoking_status
## 1 4 bmi smoking_status
## 1 5 bmi smoking_status
## 2 1 bmi smoking_status
## 2 2 bmi smoking_status
## 2 3 bmi smoking_status
## 2 4 bmi smoking_status
## 2 5 bmi smoking_status
## 3 1 bmi smoking_status
## 3 2 bmi smoking_status
## 3 3 bmi smoking_status
## 3 4 bmi smoking_status
## 3 5 bmi smoking_status
## 4 1 bmi smoking_status
## 4 2 bmi smoking_status
## 4 3 bmi smoking_status
## 4 4 bmi smoking_status
## 4 5 bmi smoking_status
## 5 1 bmi smoking_status
## 5 2 bmi smoking_status
## 5 3 bmi smoking_status
## 5 4 bmi smoking_status
## 5 5 bmi smoking_status
df_imp <- complete(imp_mice)Number of NAs in bmi: 0
Number of NAs in smoking_status: 0
I assume that distributions after imputation should not change significantly, i.e. distribution’s shape and mean should remain the same.
There was no BMI imputation after trimming BMI > 60
bmi_imp_comp <- bind_rows(select(df_trim, bmi, stroke) %>% mutate(type = rep("original", nrow(df_trim))),
select(df_imp, bmi, stroke) %>% mutate(type = rep("imputed", nrow(df_imp))))
ggplot(bmi_imp_comp, aes(bmi)) +
geom_histogram(aes(fill = type), alpha = 0.8) +
facet_grid(cols = vars(stroke))smoke_imp_comp <- bind_rows(select(df_trim, smoking_status, stroke) %>% mutate(type = rep("original", nrow(df_trim))),
select(df_imp, smoking_status, stroke) %>% mutate(type = rep("imputed", nrow(df_imp))))
ggplot(smoke_imp_comp, aes(smoking_status)) +
geom_bar(aes(fill=type), alpha=0.8, position="dodge") +
facet_grid(cols = vars(stroke)) +
xlab("")+
theme(axis.text.x = element_text(angle=45, vjust = 0.5))Counts increased proportionally in all smoking_status
groups
Scale and center all the numeric features (including imputed BMI)
# use caret::preProcess()
# preProcValues <- preProcess(training, method = c("center", "scale"))
df_scaled <- df_imp %>%
select(avg_glucose_level, age, bmi) %>%
scale() %>%
data.frame()Some variables (hypertension,
heart_disease, ever_married) already have a
dummy form.
The rest (gender, work_type,
Residence_type, smoking_status) has to be
dummified
# select vars
to_dum <- df_imp %>% select(gender, work_type, Residence_type, smoking_status)
# make an obj
dummies <- dummyVars(~ ., data = to_dum)
# apply it
df_dummy <- data.frame(predict(dummies, newdata = to_dum))
# look
head(df_dummy)Join the scaled vars, the new dummies and the rest together.
df_proc <- bind_cols(df_scaled, df_dummy, select(df_trim, hypertension, heart_disease, ever_married, stroke))
head(df_proc)Now we have 19 variables/features.
5-fold cross validation repeated 10 times will be used.
ROC-optimization will be used (it is better when data is imbalanced).
# for ROC
fit_ctrl_roc <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
allowParallel = T,
classProbs = T,
summaryFunction = twoClassSummary)The data set is heavily imbalanced, so I will use Synthetic Minority Oversampling Technique (SMOTE) to create training data set, but not testing one.
set.seed(1234)
sample_set <- createDataPartition(y = df_proc$stroke, p = .75, list = FALSE)
df_train <- df_proc[sample_set,]
df_test <- df_proc[-sample_set,]
# DMwR::SMOTE for imbalanced data: over=225 and under=150 give me 1:1 ratio
df_train_smote <- SMOTE(stroke ~ ., data.frame(df_train), perc.over = 1725, perc.under = 106)
df_train_smote %>% group_by(stroke) %>% summarise(N=n())Now the data set is balanced.
set.seed(122)
#THREADS <- 6
#library(doParallel)
#cl <- makePSOCKcluster(THREADS)
#registerDoParallel(cl)
fit_rf <- train(stroke ~ .,
data = df_train_smote,
metric = "ROC",
method = "rf",
trControl = fit_ctrl_roc,
tuneGrid = expand.grid(.mtry = c(7.5, 8.5, 9.0, 9.5)),
verbosity = 0,
ntree = 25,
nodesize = 1,
verbose = FALSE)
#stopCluster(cl)
fit_rf## Random Forest
##
## 6735 samples
## 19 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 5388, 5387, 5388, 5389, 5388, 5387, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 7.5 0.9910042 0.9856639 0.9475039
## 8.5 0.9912201 0.9844767 0.9477711
## 9.0 0.9911390 0.9844171 0.9473852
## 9.5 0.9912828 0.9848321 0.9469396
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.5.
imp_vars_rf <- varImp(fit_rf)
plot(imp_vars_rf, main = "Variable Importance with RF")Here’s a function to extract data for ROC related plots below.
get_roc <- function(fit.obj, testing.df){
pred_prob <- predict.train(fit.obj, newdata = testing.df, type = "prob")
pred_roc <- prediction(predictions = pred_prob$yes, labels = testing.df$stroke)
perf_roc <- performance(pred_roc, measure = "tpr", x.measure = "fpr")
return(list(perf_roc, pred_roc))
}# calculate ROC
perf_pred <- get_roc(fit_rf, df_test)
perf_rf <- perf_pred[[1]]
pred_rf <- perf_pred[[2]]
# take AUC
auc_rf <- round(unlist(slot(performance(pred_rf, measure = "auc"), "y.values")), 3)
# plot
plot(perf_rf, main = "Random Fores ROC curve", col = "steelblue", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
legend(x = 0.7, y = 0.3, legend = paste0("AUC = ", auc_rf))Ideally, we should be able to predict all stroke cases (TPR = 1.0).
TPR/FPR cutoff of the model should be adjusted to ‘catch’ all stroke cases.
At which probability cutoff, you’ll get TPR = 1.0?
# use pred_rf (pred_roc) object
plot(performance(pred_rf, measure = "tpr", x.measure = "cutoff"),
col = "steelblue",
ylab = "Rate",
xlab = "Probability cutoff")
plot(performance(pred_rf, measure = "fpr", x.measure = "cutoff"),
add = T, col = "red")
legend(x = 0.65,y = 0.7, c("TPR (Recall)", "FPR (1-Spec)"),
lty = 1, col = c('steelblue', 'red'), bty = 'n', cex = 1, lwd = 2)
#abline(v = 0.02, lwd = 2, lty=6)
title("RF")TPR curve is falling too quickly, while FPR is not falling quickly enough.
The ideal cutoff should maximize TPR (i.e. Sensitivity/Recall) and minimize FPR.
In this case the ideal cutoff will be extremely low, invalidating the model.
# predict probabilities
pred_prob_rf <- predict(fit_rf, newdata = df_test, type = "prob")
# choose your cut-off
cutoff <- 0.01
# turn probabilities into classes
pred_class_rf <- ifelse(pred_prob_rf$yes > cutoff, "yes", "no")
pred_class_rf <- as.factor(pred_class_rf)
cm_rf <- confusionMatrix(data = pred_class_rf,
reference = df_test$stroke,
mode = "everything",
positive = "yes")
cm_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 572 2
## yes 639 60
##
## Accuracy : 0.4965
## 95% CI : (0.4686, 0.5243)
## No Information Rate : 0.9513
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0749
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.96774
## Specificity : 0.47234
## Pos Pred Value : 0.08584
## Neg Pred Value : 0.99652
## Precision : 0.08584
## Recall : 0.96774
## F1 : 0.15769
## Prevalence : 0.04870
## Detection Rate : 0.04713
## Detection Prevalence : 0.54910
## Balanced Accuracy : 0.72004
##
## 'Positive' Class : yes
##
Chosen cutoff is 0.01.
With this cutoff we’ve got very low Accuracy/Kappa, Specificity and PPV.
Extremely low PPV means that if the model predicts a patient to have a stroke, the chance that this patient really has a stroke is ~0.09
Is this model useful then? Not much. Although, Specificity and NPV are high, which means that if the model predicts ‘no stroke’, then with probability 0.99 there is indeed no stroke.
This xgbTree rendition has 7 parameters.
I’ve done standard hyper-parameters grid search for this model before
(with tuneLength=4).
In the code below an optimal set of hyper-parameters is chosen.
set.seed(121)
# the best tune achieved by using tuneLength=4 is as the following:
# n_rounds = 50
# max_depth = 4
# eta = 0.3
# gamma = 0
# colsample_bytree = 0.6
# min_child_weight = 1
# subsample = 1
fit_xgb <- train(stroke ~ .,
data = df_train_smote,
method = "xgbTree",
metric = "ROC",
trControl = fit_ctrl_roc,
tuneGrid = expand.grid(.nrounds = 50,
.max_depth = 4,
.eta = 0.25,
.gamma = 0.01,
.colsample_bytree = 0.6,
.min_child_weight = 1,
.subsample = 1),
nthreads = 16,
verbose = FALSE,
verbosity = 0)
fit_xgb$bestTuneimp_vars_xgb <- varImp(fit_xgb)
plot(imp_vars_xgb, main = "Variable Importance with XGB")# calculate ROC
perf_pred_xgb <- get_roc(fit_xgb, df_test)
perf_xgb <- perf_pred_xgb[[1]]
pred_xgb <- perf_pred_xgb[[2]]
# take AUC
auc_xgb <- round(unlist(slot(performance(pred_xgb, measure = "auc"), "y.values")), 3)
# plot
plot(perf_xgb, main = "xgbTree ROC curve", col = "steelblue", lwd = 3)
abline(a = 0, b = 1, lwd = 3, lty = 2, col = 1)
legend(x = 0.7, y = 0.3, legend = paste0("AUC = ", auc_xgb))This ROC-curve looks much better than the one from Random Forest. It has a section with TPR=1.0 and FPR < 0.5.
# use pred_xgb object
plot(performance(pred_xgb, measure = "tpr", x.measure = "cutoff"),
col = "steelblue",
ylab = "Rate",
xlab = "Probability cutoff")
plot(performance(pred_xgb, measure = "fpr", x.measure = "cutoff"),
add = T, col = "red")
legend(x = 0.6,y = 0.7, c("TPR (Recall)", "FPR (1-Spec)"),
lty = 1, col = c('steelblue', 'red'), bty = 'n', cex = 1, lwd = 2)
#abline(v = 0.1, lwd = 2, lty=6)
title("xgbTree")The optimal cutoff here is till very low, with high Sensitivity it will bring low Specificity and PPV.
pred_prob_xgb <- predict(fit_xgb, newdata=df_test, type = "prob")
# choose your cut-off
cutoff <- 0.025
# turn probabilities into classes
pred_class_xgb <- ifelse(pred_prob_xgb$yes > cutoff, "yes", "no")
pred_class_xgb <- as.factor(pred_class_xgb)
cm_xgb <- confusionMatrix(data = pred_class_xgb,
reference = df_test$stroke,
mode = "everything",
positive = "yes")
cm_xgb## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 647 7
## yes 564 55
##
## Accuracy : 0.5515
## 95% CI : (0.5236, 0.579)
## No Information Rate : 0.9513
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0801
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.88710
## Specificity : 0.53427
## Pos Pred Value : 0.08885
## Neg Pred Value : 0.98930
## Precision : 0.08885
## Recall : 0.88710
## F1 : 0.16153
## Prevalence : 0.04870
## Detection Rate : 0.04321
## Detection Prevalence : 0.48625
## Balanced Accuracy : 0.71068
##
## 'Positive' Class : yes
##
Chosen cutoff is 0.025.
This model performs better then the previous one but not well enough - we still have a lot of false positive results.
My guess is that life style is a poor predictor of stroke. In order to get better performing models, we need more data related to physiology, biochemistry or genetics of patients.
# save it
save.image("data/workspace.RData")